In [1]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(".."))
Having discovered further trouble replicating the results of SaTScan, we introduce some more support for reading and writing SaTScan files, and test various corner cases.
The class AbstractSTScan
works with "generic time" (so just numbers, now interpretted as some time unit before an epoch time). This allows us to concentrate on the details. We also introduce a more complicated rule about cases when the boundary of a disc contains more than one point (see below).
The class STScanNumpy
takes the same data and settings as AbstractSTScan
, but uses a parallel numpy
programme style to improve performance. Like the original implementation, and unlike AbstractSTScan
, it does nothing special about events which fall on the boundary of disks.
In the following, we set the "population" limits to 50% (the SaTScan default) and set the maximum radius and time lengths to be effectively infinity, given the inputs. Hence these results should be directly comparable with the results from SaTScan for a "Prospective Analyses" / "Space-Time" and "Space-Time Permutation" with otherwise default options.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import open_cp.stscan, open_cp.stscan2
import numpy as np
In [3]:
def make_random_data(s=100):
times = np.floor(np.random.random(size=s) * 200)
times.sort()
times = np.flipud(times)
coords = np.random.random(size=(2,s)) * 100
return coords, times
def build_ab_scan(coords, times):
ab_scan = open_cp.stscan2.AbstractSTScan(coords, times)
ab_scan.geographic_radius_limit = 1000
ab_scan.geographic_population_limit = 0.5
ab_scan.time_max_interval = 200
ab_scan.time_population_limit = 0.5
return ab_scan
def build_stscan_numpy(coords, times):
stsn = open_cp.stscan2.STScanNumpy(coords, times)
stsn.geographic_radius_limit = 1000
stsn.geographic_population_limit = 0.5
stsn.time_max_interval = 200
stsn.time_population_limit = 0.5
return stsn
def to_timed_points(coords, times):
"""Convert to days before 2017-04-01 and use `STSTrainerSlow`."""
timestamps = (np.timedelta64(1,"D") / np.timedelta64(1,"s")) * times * np.timedelta64(1,"s")
timestamps = np.datetime64("2017-04-01T00:00") - timestamps
return open_cp.data.TimedPoints(timestamps, coords)
def build_trainer(coords, times):
trainer = open_cp.stscan.STSTrainerSlow()
trainer.data = to_timed_points(coords, times)
trainer.time_max_interval = np.timedelta64(200,"D")
trainer.time_population_limit = 0.5
trainer.geographic_population_limit = 0.5
trainer.geographic_radius_limit = 1000
return trainer
def build_trainer_fast(coords, times):
trainer = open_cp.stscan.STSTrainer()
trainer.data = to_timed_points(coords, times)
trainer.time_max_interval = np.timedelta64(200,"D")
trainer.time_population_limit = 0.5
trainer.geographic_population_limit = 0.5
trainer.geographic_radius_limit = 1000
return trainer
We find that most of the time, we obtain the same clusters. But sometimes we don't. This is down to:
From this point of view, obtaining perfect agreement with SaTScan seems an almost hopeless ideal!
In [4]:
coords, times = make_random_data()
ab_scan = build_ab_scan(coords, times)
all_clusters = list(ab_scan.find_all_clusters())
for c in all_clusters:
print(c.centre, c.radius, c.time, c.statistic)
In [5]:
stsn = build_stscan_numpy(coords, times)
all_clusters = list(stsn.find_all_clusters())
for c in all_clusters:
print(c.centre, c.radius, c.time, c.statistic)
In [6]:
trainer = build_trainer(coords, times)
result = trainer.predict(time=np.datetime64("2017-04-01T00:00"))
for c, t, s in zip(result.clusters, result.time_ranges, result.statistics):
assert np.datetime64("2017-04-01T00:00") == t[1]
t = (np.datetime64("2017-04-01T00:00") - t[0]) / np.timedelta64(1,"D")
print(c, t, s)
In [7]:
trainer_fast = build_trainer_fast(coords, times)
result_fast = trainer_fast.predict(time=np.datetime64("2017-04-01T00:00"))
for c, t, s in zip(result_fast.clusters, result_fast.time_ranges, result_fast.statistics):
assert np.datetime64("2017-04-01T00:00") == t[1]
t = (np.datetime64("2017-04-01T00:00") - t[0]) / np.timedelta64(1,"D")
print(c, t, s)
In [8]:
# Plots the old "trainer" implementation against the abstract numpy implementation
import matplotlib.patches
fig, ax = plt.subplots(figsize=(8,8))
ax.set(xlim=[-10,110], ylim=[-10,110])
for c in result.clusters:
cir = matplotlib.patches.Circle(c.centre, c.radius, alpha=0.5)
ax.add_patch(cir)
for c in all_clusters:
cir = matplotlib.patches.Circle(c.centre, c.radius, alpha=0.5, color="red")
ax.add_patch(cir)
In [9]:
%timeit( list(ab_scan.find_all_clusters()) )
In [10]:
%timeit( list(stsn.find_all_clusters()) )
In [11]:
%timeit( trainer.predict() )
In [12]:
%timeit( trainer_fast.predict() )
In [ ]:
#ab_scan.to_satscan("satscan_test2", 1000)
Where we have found quite different behaviour from SaTScan is in "boundary" behaviour. Consider the case when a disk's boundary (it's circumference) contains more than one event. The STSTrainer
class always considers all events inside or on the edge of the disk. But SaTScan will sometimes consider events inside the disc, and then only some of the events on the boundary.
Notice in particular that we can expect this to happen a lot if the input data is on a regular grid.
We try to replicate this behaviour in AbstractSTScan
by considering all possibilities of events on the boundary being counted or not. Unfortunately, we then seem to beat SaTScan at its own game, and consider too many subsets, resulting in finding clusters which SaTScan does not.
The first example below shows where AbstractSTScan
is more aggresive than SaTScan. The 2nd example shows where SaTScan does indeed fail to consider all events in a disc, and gets the same result as AbstractSTScan
.
In [26]:
def trainer_to_data(trainer):
coords = trainer.data.coords
times = (np.datetime64("2017-04-01T00:00") - trainer.data.timestamps) / np.timedelta64(1,"s")
times /= (np.timedelta64(1,"D") / np.timedelta64(1,"s"))
times = np.floor(times)
return coords, times
np.testing.assert_array_almost_equal(trainer_to_data(trainer)[0], coords)
np.testing.assert_array_almost_equal(trainer_to_data(trainer)[1], times)
In [27]:
trainer = build_trainer(*make_random_data())
region = open_cp.RectangularRegion(xmin=0, ymin=0, xmax=100, ymax=100)
ab_scan = build_ab_scan( *trainer_to_data( trainer.grid_coords(region, grid_size=20) ) )
In [28]:
all_clusters = list(ab_scan.find_all_clusters())
for c in all_clusters:
print(c.centre, c.radius, c.time, c.statistic)
In [ ]:
#ab_scan.to_satscan("satscan_test1", 1000)
In [29]:
def find_satscan_ids_for_mask(in_disc, time):
in_disc &= ab_scan.timestamps <= time
in_disc = set( (x,y) for x,y in ab_scan.coords[:,in_disc].T )
return [i for i in satscan_data.geo if satscan_data.geo[i] in in_disc]
def find_mask(centre, radius):
return np.sum((ab_scan.coords - np.array(centre)[:,None])**2, axis=0) <= radius**2
def to_our_indexes(sat_scan_indexes):
out = set()
for i in sat_scan_indexes:
x, y = satscan_data.geo[i]
m = (ab_scan.coords[0] == x) & (ab_scan.coords[1] == y)
for j in np.arange(ab_scan.coords.shape[1])[m]:
out.add(j)
return out
In [30]:
satscan_data = open_cp.stscan2.SaTScanData("satscan_test3", 1000)
ab_scan = build_ab_scan( *satscan_data.to_coords_time() )
all_clusters = list(ab_scan.find_all_clusters())
for c in all_clusters:
print(c.centre, c.radius, c.time, c.statistic)
In [31]:
# Cluster which SaTScan finds -- In this case, seemingly SaTScan includes all events
in_disc = find_mask([30,30], 20)
find_satscan_ids_for_mask(in_disc, 70)
Out[31]:
In [32]:
# Our cluster-- all events in or on the disc
in_disc = find_mask([50,30], 20)
find_satscan_ids_for_mask(in_disc, 45)
Out[32]:
In [33]:
# The subset of events our algorithm chooses to use
in_disc = all_clusters[0].mask
find_satscan_ids_for_mask(in_disc, 45)
Out[33]:
In [34]:
# The numpy code should, mostly, replicate what SaTScan does
stsn = build_stscan_numpy( *satscan_data.to_coords_time() )
all_clusters = list(stsn.find_all_clusters())
for c in all_clusters:
print(c.centre, c.radius, c.time, c.statistic)
This example actually seems to show SaTScan not including all points in a disc. SaTScan reports:
1.Location IDs included.: 23, 6, 16
Coordinates / radius..: (30,30) / 20.00
Time frame............: 993 to 1000
Number of cases.......: 3
Expected cases........: 0.42
Observed / expected...: 7.14
Test statistic........: 3.352053
P-value...............: 0.202
Recurrence interval...: 5.0 units
Now, we note that:
In [35]:
satscan_data = open_cp.stscan2.SaTScanData("satscan_test1", 1000)
coords, times = satscan_data.to_coords_time()
ab_scan = build_ab_scan(coords, times)
all_clusters = list(ab_scan.find_all_clusters())
for c in all_clusters:
print(c.centre, c.radius, c.time, c.statistic)
In [36]:
in_disc = find_mask([30,30], 20)
find_satscan_ids_for_mask(in_disc,7)
Out[36]:
In [37]:
in_disc = find_mask([30,30], 20)
find_satscan_ids_for_mask(in_disc,10000)
Out[37]:
In [38]:
satscan_data.geo[6], satscan_data.geo[16], satscan_data.geo[11], satscan_data.geo[23], satscan_data.geo[24]
Out[38]:
In [39]:
time_mask = times <= 7
space_mask = np.sum( (coords - np.array([30,30])[:,None])**2, axis=0) <= 20**2
expected = np.sum(space_mask) * np.sum(time_mask) / 100
actual = np.sum(space_mask & time_mask)
actual, expected, ab_scan._statistic(actual, expected, 100)
Out[39]:
In [40]:
# The above Statistic is smaller than the one SaTScan finds, because the expected count is too large
# But if we limit the spacial region to the ids SaTScan claims, we obtain a perfect match
expected = len(to_our_indexes([23, 6, 16])) * np.sum(time_mask) / 100
expected
Out[40]:
In [41]:
# The numpy accelerated code doesn't find the same clusters
stsn = build_stscan_numpy(coords, times)
all_clusters = list(stsn.find_all_clusters())
for c in all_clusters:
print(c.centre, c.radius, c.time, c.statistic)
In [ ]:
In [ ]: